Implement column projection #1443

gabeiglio · 2024-12-18T20:26:02Z

This is a fix for issue #1401. In which table scans needed to infer partition column by following the column projection rules

Fixes #1401

…ction together

…an initial-default

kevinjqliu

Added a few comments, please take a look! The PR looks great already. Thanks for working on this!

pyiceberg/io/pyarrow.py

tests/io/test_pyarrow.py

pyiceberg/io/pyarrow.py

…tion logic to helper method, changed test to use high-level table scan

pyiceberg/io/pyarrow.py

tests/io/test_pyarrow.py

pyiceberg/io/pyarrow.py

…test

kevinjqliu

generally LGTM! I added a few nit comments and some clarifying questions on testing.

thanks for working on this!

tests/io/test_pyarrow.py

pyiceberg/io/pyarrow.py

tests/io/test_pyarrow.py

kevinjqliu · 2025-01-20T19:04:16Z

tests/io/test_pyarrow.py

+    partition_spec = PartitionSpec(
+        PartitionField(2, 1000, VoidTransform(), "void_partition_id"),
+        PartitionField(2, 1001, IdentityTransform(), "partition_id"),
+    )


i think we'd want to test multiple IdentityTransforms here.

im thinking about a case for multiple-level of partitioning in hive-style.

s3://my_table/a=100/b=foo/...parquet

i think _get_column_projection_values might not support this right now

hmm, got it, I think it is supported with this new commit, what is doing is that before injecting the value in the RecordBatch, it checks if that name is present in the schema before injecting it.

kevinjqliu

Looks like CI caught an interesting case where a new identity partition is added after data files were written. The accessor then cannot find the proper partition record... We need to do something like this

kevinjqliu · 2025-02-01T01:26:53Z

tests/io/test_pyarrow.py

+    )
+
+    partition_spec = PartitionSpec(
+        PartitionField(2, 1000, IdentityTransform(), "void_partition_id"),


nit: avoid using void since its a type of transform https://iceberg.apache.org/spec/#partition-transforms

kevinjqliu · 2025-02-01T01:28:40Z

tests/io/test_pyarrow.py

+partition_id: int64
+----
+other_field: [["foo"]]
+partition_id: [[1]]"""


shouldnt this project void_partition_id=12 as well?

kevinjqliu · 2025-02-01T01:29:49Z

tests/io/test_pyarrow.py

+    )
+
+
+def test_identity_transform_columns_projection(tmp_path: str, catalog: InMemoryCatalog) -> None:


i was thinking we can test something like 3 fields where 2 are identity partitions.

to check the scenario for multi-level hive partition, for example s3://foo/year=2025/month=06/blah.parquet

Gabriel Igliozzi and others added 3 commits December 18, 2024 12:01

Initial commit for fix

f814ee1

Add test and commit lint changes

cf36660

Merge branch 'apache:main' into specPartitionIdentity

2fb6a16

Fokko self-requested a review December 18, 2024 20:38

Gabriel Igliozzi added 3 commits December 19, 2024 00:19

default-value bug fixes and adding more tests

7982465

Add continue, check file_schema before using it, group steps of proje…

e4d5882

…ction together

Fix lint issues, reorder partition spec to be of higher importance th…

694a52d

…an initial-default

gabeiglio marked this pull request as ready for review December 19, 2024 15:12

kevinjqliu reviewed Dec 19, 2024

View reviewed changes

pyiceberg/io/pyarrow.py Outdated Show resolved Hide resolved

tests/io/test_pyarrow.py Outdated Show resolved Hide resolved

pyiceberg/io/pyarrow.py Outdated Show resolved Hide resolved

Fokko reviewed Dec 20, 2024

View reviewed changes